Credit Card Fraud Detection

Anomaly detection is a classification process in which rare items, events, or observations in data sets are identified. Learn more about this here. In this article, we investigate Credit Card Fraud Detection dataset from Kaggle.com.

Credit Card Fraud Detection

In this article, we try to improve the results of our first modeling by implementing a PyTorch GPU-based model. Moreover, the standardized data is available from our previous modeling and here we simply resume from that point.

Data Correlations

Now, let's take a look at the variance of the features.

Train and Test Sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: PyTorch Multi-layer Perceptron (MLP) for Binary classification

A multi-layer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The algorithm at each iteration uses the Cross-Entropy Loss to measure the loss, and then the gradient and the model update is calculated. At the end of this iterative process, we would reach a better level of agreement between test and predicted sets since the error would be lower from that of the first step.

Setting up Tensor Arrays

Modeling

Fitting the model

Model Performance

After step 25000, the accuracy and the loss didn't improve. This means that we only needed about 25000 steps.

Confusion Matrix

We use Cross-validation with 20 splittings.

Some of the metrics that we use here to mesure the accuracy: \begin{align} \text{Confusion Matrix} = \begin{bmatrix}T_p & F_p\\ F_n & T_n\end{bmatrix}. \end{align}

where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.

\begin{align} \text{Precision} &= \frac{T_{p}}{T_{p} + F_{p}},\\ \text{Recall} &= \frac{T_{p}}{T_{p} + F_{n}},\\ \text{F1} &= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\\ \text{Balanced-Accuracy (bACC)} &= \frac{1}{2}\left( \frac{T_{p}}{T_{p} + F_{n}} + \frac{T_{n}}{T_{n} + F_{p}}\right ) \end{align}

The accuracy can be a misleading metric for imbalanced data sets. In these cases, a balanced accuracy (bACC) [6] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two.


References

  1. Kaggle Dataset: Credit Card Fraud Detection
  2. scikit-learn: classifiers
  3. scikit-learn: Metrics and scoring: quantifying the quality of predictions
  4. Confusion matrix
  5. Multi-layer Perceptron classifier
  6. Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.